176 research outputs found
A 20 MHz CMOS reorder buffer for a superscalar microprocessor
Superscalar processors can achieve increased performance by issuing instructions out-of-order from the original sequential instruction stream. Implementing an out-of-order instruction issue policy requires a hardware mechanism to prevent incorrectly executed instructions from updating register values. A reorder buffer can be used to allow a superscalar processor to issue instructions out-of-order and maintain program correctness. This paper describes the design and implementation of a 20MHz CMOS reorder buffer for superscalar processors. The reorder buffer is designed to accept and retire two instructions per cycle. A full-custom layout in 1.2 micron has been implemented, measuring 1.1058 mm by 1.3542 mm
Energy and performance-aware application mapping for inhomogeneous 3D networks-on-chip
Three dimensional Networks-on-Chip (3D NoCs) have evolved as an ideal solution to the communication demands and complexity of future high density many core architectures. However, the design practicality of 3D NoCs faces several challenges such as thermal issues, high power consumption and area overhead of 3D routers as well as high complexity and cost of vertical link implementation. To mitigate the performance and manufacturing cost of 3D NoCs, inhomogeneous architectures have emerged to combine 2D and 3D routers in 3D NoCs producing lower area and energy consumption while maintaining the performance of homogeneous 3D NoCs. Due to the limited number of vertical links, application mapping on inhomogeneous 3D NoCs can be complex. However, application mapping has a great impact on the performance and energy consumption of NoCs. This paper presents an energy and performance aware application mapping algorithm for inhomogeneous 3D NoCs. The algorithm has been evaluated with various realistic traffic patterns and compared with existing mapping algorithms. Experimental results show NoCs mapped with the proposed algorithm have lower energy consumption and significant reduction in packet delays compared to the existing algorithms and comparable average packet latency with Branch-and-Bound
Hybrid U-Net: Semantic Segmentation of High-Resolution Satellite Images to Detect War Destruction
Destruction caused by violent conflicts play a big role in understanding the dynamics and consequences of conflicts, which is now the focus of a large body of ongoing literature in economics and political science. However, existing data on conflict largely come from news or eyewitness reports, which makes it incomplete, potentially unreliable, and biased for ongoing conflicts. Using satellite images and deep learning techniques, we can automatically extract objective information on violent events. To automate this process, we created a dataset of high-resolution satellite images of Syria and manually annotated the destroyed areas pixel-wise. Then, we used this dataset to train and test semantic segmentation networks to detect building damage of various size. We specifically utilized a U-Net model for this task due to its promising performance on small and imbalanced datasets. However, the raw U-Net architecture does not fully exploit multi-scale feature maps, which are among the important factors for generating fine-grained segmentation maps, especially for high-resolution images. To address this deficiency, we propose a multi-scale feature fusion approach and design a multi-scale skip-connected Hybrid U-Net for segmenting high-resolution satellite images. In our experiments, U-Net and its variants demonstrated promising segmentation results to detect various war-related building destruction. In addition, Hybrid U-Net resulted in significant improvement in segmentation performance compared to U-Net and other baselines. In particular, the mean intersection over union and mean dice score improved by 7.05% and 8.09%, respectively, compared to those in the raw U-Net
The Effects of Approximate Multiplication on Convolutional Neural Networks
This paper analyzes the effects of approximate multiplication when performing
inferences on deep convolutional neural networks (CNNs). The approximate
multiplication can reduce the cost of the underlying circuits so that CNN
inferences can be performed more efficiently in hardware accelerators. The
study identifies the critical factors in the convolution, fully-connected, and
batch normalization layers that allow more accurate CNN predictions despite the
errors from approximate multiplication. The same factors also provide an
arithmetic explanation of why bfloat16 multiplication performs well on CNNs.
The experiments are performed with recognized network architectures to show
that the approximate multipliers can produce predictions that are nearly as
accurate as the FP32 references, without additional training. For example, the
ResNet and Inception-v4 models with Mitch-6 multiplication produces Top-5
errors that are within 0.2% compared to the FP32 references. A brief cost
comparison of Mitch-6 against bfloat16 is presented, where a MAC operation
saves up to 80% of energy compared to the bfloat16 arithmetic. The most
far-reaching contribution of this paper is the analytical justification that
multiplications can be approximated while additions need to be exact in CNN MAC
operations.Comment: 12 pages, 11 figures, 4 tables, accepted for publication in the IEEE
Transactions on Emerging Topics in Computin
Mapping and Scheduling in Heterogeneous NoC through Population-Based Incremental Learning
ABSTRACT: Network-on-Chip (NoC) is a growing and promising communication paradigm
for Multiprocessor-System-On-Chip (MPSoC) design, because of its scalability
and performance features. In designing such systems, mapping and scheduling are becoming
critical stages, because of the increase of both size of the network and application’s
complexity. Some reported solutions solve each issue independently. However,
a conjoint approach for solving mapping and scheduling allows to take into account
both computation and communication objectives simultaneously. This paper shows a
mapping and scheduling solution, which is based on a Population-Based Incremental
Learning (PBIL) algorithm. The simulation results suggest that our PBIL approach
is able to find optimal mapping and scheduling, in a multi-objective fashion. A 2-D
heterogeneous mesh was used as target architecture for implementation, although the
PBIL representation is suited to deal with more complex architectures, such as 3-D
meshes
Self-optimized Routing in a Networkon-a-Chip
Abstract Many-cores are on the cusp of becoming state-of-the-art processor technology for the next decade. To guarantee efficient communication between multiple cores, a Network-on-a-Chip (NoC) is considered as an alternative to overcome the limitations of the ubiquitous bus technology. In this paper, we present an approach to further improve the routing in an NoC with a self-optimized routing strategy. We extended the routers of a network to measure their load and to send an appropriate load information to their direct neighbors. The load information is used to decide in which direction a packet should be routed to avoid hot-spots. Evaluation results show a significant increase in the network throughput. With the self-optimized routing, the NoC is capable of routing up to two times more packets compared to the original routing algorithm proposed b
- …